
The Motor Vehicle Collisions crash table contains details on the crash event. Each row represents a crash event. The Motor Vehicle Collisions data tables contain information from all police reported motor vehicle collisions in NYC. The police report (MV104-AN) is required to be filled out for collisions where someone is injured or killed, or where there is at least $1000 worth of damage (https://www.nhtsa.gov/sites/nhtsa.dot.gov/files/documents/ny_overlay_mv-104an_rev05_2004.pdf). It should be noted that the data is preliminary and subject to change when the MV-104AN forms are amended based on revised crash details.For the most accurate, up to date statistics on traffic fatalities, please refer to the NYPD Motor Vehicle Collisions page (updated weekly) or Vision Zero View (updated monthly).
![]() |
|---|
| 6 Step Machine Learning Modelling Framework |
More specifically, we'll look at the following topics.
To work through these topics, we'll use pandas, Matplotlib and NumPy for data anaylsis, as well as, Scikit-Learn for machine learning and modelling tasks.
![]() |
|---|
| Tools which can be used for each step of the machine learning modelling process. |

There really is no particular problem we're trying to solve. We're just analyzing data and looking for insights that we can use to draw conclusions.
There really is no particular problem we're trying to solve. We're just analyzing data and looking for insights that we can use to draw conclusions.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import warnings
warnings.filterwarnings('ignore')
import time
import networkx as nx
from plotly.graph_objs import *
What you'll want to do here is dive into the data your problem definition is based on. This may involve, sourcing, defining different parameters, talking to experts about it and finding out what you should expect.
We'll look for Attributes (also called features) and variables that we'll potentially use to predict our target variable.
Attributes and features are also referred to as independent variables and a target variable can be referred to as a dependent variable.
df = pd.read_csv('Datasets/Motor_Vehicle_Collisions_-_Crashes.csv')
collisionsDF = df.copy()
df_copy = df.copy()
df_copy3 = df.copy()
all_dataC = df.copy()
df.head().T
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| CRASH DATE | 08/18/2020 | 09/25/2020 | 09/05/2020 | 09/30/2020 | 08/19/2020 |
| CRASH TIME | 12:00 | 0:01 | 19:25 | 14:00 | 12:36 |
| BOROUGH | BROOKLYN | NaN | STATEN ISLAND | BROOKLYN | BRONX |
| ZIP CODE | 11214 | NaN | 10305 | 11235 | 10472 |
| LATITUDE | 40.601 | NaN | 40.6044 | 40.5864 | 40.8307 |
| LONGITUDE | -73.9935 | NaN | -74.069 | -73.9479 | -73.8671 |
| LOCATION | (40.60101, -73.99352) | NaN | (40.604397, -74.069) | (40.586376, -73.9479) | (40.830738, -73.86707) |
| ON STREET NAME | NaN | TRIBOROUGH BRIDGE | FINGERBOARD ROAD | VOORHIES AVENUE | NaN |
| CROSS STREET NAME | NaN | NaN | NARROWS ROAD NORTH | EAST 21 STREET | NaN |
| OFF STREET NAME | 15 BAY 31 STREET | NaN | NaN | NaN | 1231 SAINT LAWRENCE AVENUE |
| NUMBER OF PERSONS INJURED | 0 | 0 | 0 | 2 | 0 |
| NUMBER OF PERSONS KILLED | 0 | 0 | 0 | 0 | 0 |
| NUMBER OF PEDESTRIANS INJURED | 0 | 0 | 0 | 0 | 0 |
| NUMBER OF PEDESTRIANS KILLED | 0 | 0 | 0 | 0 | 0 |
| NUMBER OF CYCLIST INJURED | 0 | 0 | 0 | 0 | 0 |
| NUMBER OF CYCLIST KILLED | 0 | 0 | 0 | 0 | 0 |
| NUMBER OF MOTORIST INJURED | 0 | 0 | 0 | 2 | 0 |
| NUMBER OF MOTORIST KILLED | 0 | 0 | 0 | 0 | 0 |
| CONTRIBUTING FACTOR VEHICLE 1 | Backing Unsafely | Other Vehicular | Driver Inattention/Distraction | Driver Inattention/Distraction | Unspecified |
| CONTRIBUTING FACTOR VEHICLE 2 | Unspecified | NaN | Unspecified | Unspecified | Unspecified |
| CONTRIBUTING FACTOR VEHICLE 3 | NaN | NaN | Unspecified | NaN | NaN |
| CONTRIBUTING FACTOR VEHICLE 4 | NaN | NaN | NaN | NaN | NaN |
| CONTRIBUTING FACTOR VEHICLE 5 | NaN | NaN | NaN | NaN | NaN |
| COLLISION_ID | 4342387 | 4351600 | 4346201 | 4353455 | 4340085 |
| VEHICLE TYPE CODE 1 | Station Wagon/Sport Utility Vehicle | Station Wagon/Sport Utility Vehicle | Sedan | Pick-up Truck | Station Wagon/Sport Utility Vehicle |
| VEHICLE TYPE CODE 2 | Station Wagon/Sport Utility Vehicle | NaN | Sedan | Motorscooter | Station Wagon/Sport Utility Vehicle |
| VEHICLE TYPE CODE 3 | NaN | NaN | Sedan | NaN | NaN |
| VEHICLE TYPE CODE 4 | NaN | NaN | NaN | NaN | NaN |
| VEHICLE TYPE CODE 5 | NaN | NaN | NaN | NaN | NaN |
df.shape
df.describe()
df.info()
(1734272, 29)
df.columns
Index(['CRASH DATE', 'CRASH TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE',
'LONGITUDE', 'LOCATION', 'ON STREET NAME', 'CROSS STREET NAME',
'OFF STREET NAME', 'NUMBER OF PERSONS INJURED',
'NUMBER OF PERSONS KILLED', 'NUMBER OF PEDESTRIANS INJURED',
'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2',
'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5'],
dtype='object')
#Number of accidents per Borough
df.groupby('BOROUGH').size()
BOROUGH BRONX 171786 BROOKLYN 376159 MANHATTAN 281857 QUEENS 321879 STATEN ISLAND 51026 dtype: int64
df['ZIP CODE'].isna().sum()
531775
# check missing values
df.isnull().any()
CRASH DATE False CRASH TIME False BOROUGH True ZIP CODE True LATITUDE True LONGITUDE True LOCATION True ON STREET NAME True CROSS STREET NAME True OFF STREET NAME True NUMBER OF PERSONS INJURED True NUMBER OF PERSONS KILLED True NUMBER OF PEDESTRIANS INJURED False NUMBER OF PEDESTRIANS KILLED False NUMBER OF CYCLIST INJURED False NUMBER OF CYCLIST KILLED False NUMBER OF MOTORIST INJURED False NUMBER OF MOTORIST KILLED False CONTRIBUTING FACTOR VEHICLE 1 True CONTRIBUTING FACTOR VEHICLE 2 True CONTRIBUTING FACTOR VEHICLE 3 True CONTRIBUTING FACTOR VEHICLE 4 True CONTRIBUTING FACTOR VEHICLE 5 True COLLISION_ID False VEHICLE TYPE CODE 1 True VEHICLE TYPE CODE 2 True VEHICLE TYPE CODE 3 True VEHICLE TYPE CODE 4 True VEHICLE TYPE CODE 5 True dtype: bool
print(df[["BOROUGH","NUMBER OF PERSONS INJURED"]].groupby("BOROUGH").count())
NUMBER OF PERSONS INJURED BOROUGH BRONX 171785 BROOKLYN 376155 MANHATTAN 281854 QUEENS 321876 STATEN ISLAND 51026
#NUMBER OF PERSONS Killed per BOROUGH
print(df[["BOROUGH","NUMBER OF PERSONS KILLED"]].groupby("BOROUGH").count())
print(df[["BOROUGH","NUMBER OF PEDESTRIANS KILLED"]].groupby("BOROUGH").count())
NUMBER OF PERSONS KILLED
BOROUGH
BRONX 171779
BROOKLYN 376154
MANHATTAN 281851
QUEENS 321875
STATEN ISLAND 51025
NUMBER OF PEDESTRIANS KILLED
BOROUGH
BRONX 171786
BROOKLYN 376159
MANHATTAN 281857
QUEENS 321879
STATEN ISLAND 51026
#getting the counts of people killed by borough then rearranging the columns
persk_data=df[["BOROUGH","NUMBER OF PERSONS KILLED"]].groupby("BOROUGH").count()
persk_data['BOROUGH'] = persk_data.index
persk_data=persk_data.reset_index(drop=True)
pedk_data=df[["BOROUGH","NUMBER OF PEDESTRIANS KILLED"]].groupby("BOROUGH").count()
pedk_data['BOROUGH'] = pedk_data.index
pedk_data=pedk_data.reset_index(drop=True)
motk_data=df[["BOROUGH","NUMBER OF MOTORIST KILLED"]].groupby("BOROUGH").count()
motk_data['BOROUGH'] = motk_data.index
motk_data=motk_data.reset_index(drop=True)
s1 = persk_data.set_index('BOROUGH')['NUMBER OF PERSONS KILLED']
s2 = pedk_data.set_index('BOROUGH')['NUMBER OF PEDESTRIANS KILLED']
s3 = motk_data.set_index('BOROUGH')['NUMBER OF MOTORIST KILLED']
df1 = (s1+s2+s3).reset_index(name='Total People Killed')
print (df1)
BOROUGH Total People Killed 0 BRONX 515351 1 BROOKLYN 1128472 2 MANHATTAN 845565 3 QUEENS 965633 4 STATEN ISLAND 153077
import matplotlib.pyplot as plt
plt.rcdefaults()
fig, ax = plt.subplots()
y_pos= np.arange(df1.shape[0])
y=df1["Total People Killed"].values
names=df1["BOROUGH"].values
ax.barh(y_pos, y, align='center',color='green', ecolor='black')
ax.set_yticks(y_pos)
ax.set_yticklabels(names)
ax.invert_yaxis() # labels read top-to-bottom
ax.set_xlabel('People Killed in Accident')
ax.set_title('Number of People Killed Per Borough')
for i, v in enumerate(y):
ax.text(v + 3, i + .25, str(v), color='Green', fontweight='bold')
plt.show()

df_copy['DATE'] = pd.to_datetime(df_copy['CRASH DATE'], format="%m/%d/%Y")
day_data = df_copy\
.filter(items=['DATE','NUMBER OF PERSONS INJURED','NUMBER OF PERSONS KILLED'])\
.groupby([df_copy.DATE.dt.year, df_copy.DATE.dt.month, df_copy.DATE.dt.day])\
.sum()
day_data['date'] = day_data.index
day_data['date'] = pd.to_datetime(day_data['date'], format="(%Y, %m, %d)")
day_data = day_data.reset_index(drop=True)
day_data=day_data.tail(100)
# # import plotly
# import plotly.plotly as py
# import plotly.graph_objs as go
# # these two lines are what allow your code to show up in a notebook!
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
# making a chart
trace1 = go.Scatter(
x=day_data.date, # assign x as the dataframe column 'x'
y=day_data['NUMBER OF PERSONS INJURED'],
name='INJURED'
)
trace2 = go.Scatter(
x=day_data.date,
y=day_data['NUMBER OF PERSONS KILLED'],
name='KILLED'
)
plotted_data = [trace1, trace2]
layout = dict(title="Number of persons injured/killed per month",
xaxis=dict(title='Date', ticklen=10, zeroline=False))
fig = dict(data=plotted_data, layout=layout)
iplot(fig)

#delete empty rows
df=df.dropna(how='all')
#reset index to remove missing indexes
df=df.reset_index(drop=True)
#split date from "DATE" column
df['Date']=df['CRASH DATE'].apply(lambda x: x[:10])
# parse dates
for i in range(len(df['Date'])):
if(df['Date'][i].startswith('00')):
p=df['Date'][i-1].split('-')
current=df['Date'][i].split('-')
current[0]=p[0]
df['Date'][i]="-".join(current)
df['Date'] = pd.to_datetime(df['Date'], format = "%m/%d/%Y")
df['Date'][:5]
0 2020-08-18 1 2020-09-25 2 2020-09-05 3 2020-09-30 4 2020-08-19 Name: Date, dtype: datetime64[ns]
# count of accidents per month
acc_per_month = df['Date'].groupby([df.Date.dt.year, df.Date.dt.month]).agg('count')
# convert to dataframe
acc_per_month = acc_per_month.to_frame()
# move date month from index to column
acc_per_month['date'] = acc_per_month.index
# rename column
acc_per_month = acc_per_month.rename(columns={acc_per_month.columns[0]:"acc"})
# re-parse dates
acc_per_month['date'] = pd.to_datetime(acc_per_month['date'], format="(%Y, %m)")
# remove index
acc_per_month = acc_per_month.reset_index(drop=True)
# get month of meet
acc_per_month['month'] = acc_per_month.date.dt.month
# these two lines are what allow code to show up in a notebook!
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
# date on the x axis and accidents on the y axis
dataPlot2 = [go.Scatter(x=acc_per_month.date, y=acc_per_month.acc)]
# the layout of figure
layout = dict(title = "Number of Accidents per Month",
xaxis= dict(title= 'Accidents',ticklen= 2,zeroline= False))
# create and show figure
fig = dict(data = dataPlot2, layout = layout)
iplot(fig)

# def getAccidentsAndPlot(borough):
# dfs = pd.read_csv('Datasets/Motor_Vehicle_Collisions_-_Crashes.csv')
# acc_per_day = pd.DataFrame()
# #Number of accidents per day per in Brooklyn
# dfs = dfs[dfs["BOROUGH"]==borough]
# # count of accidents per day in Brooklyn
# acc_per_day = dfs['CRASH DATE'].groupby([dfs['CRASH DATE'].dt.year, dfs['CRASH DATE'].dt.month,dfs['CRASH DATE'].dt.day]).agg('count')
# # convert to dataframe
# acc_per_day = acc_per_day.to_frame()
# # move date day from index to column
# acc_per_day['date'] = acc_per_day.index
# # rename column
# acc_per_day = acc_per_day.rename(columns={acc_per_day.columns[0]:"acc"})
# # re-parse dates
# acc_per_day['date'] = pd.to_datetime(acc_per_day['date'], format="(%Y, %m, %d)")
# # remove index
# acc_per_day = acc_per_day.reset_index(drop=True)
# # get day of meet
# acc_per_day['day'] = acc_per_day.date.dt.day
# # date on the x axis and accidents on the y axis
# dataPlot3 = [go.Scatter(x=acc_per_month.date, y=acc_per_month.acc)]
# # the layout of figure
# layout = dict(title = "Number of Accidents per Day",
# xaxis= dict(title= borough,ticklen= 2,zeroline= False))
# # create and show figure
# fig = dict(data = dataPlot3, layout = layout)
# iplot(fig)
# these two lines are what allow code to show up in a notebook!
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
dataPlot = [go.Histogram(histfunc = "sum",x=names,y=y)]
# the layout of figure
layout = dict(title = "Number of Deaths Per Borough",
xaxis= dict(title= 'Borough',ticklen= 2,zeroline= False))
# create and show figure
fig = dict(data = dataPlot, layout = layout)
iplot(fig)

Let's Focus On Brooklyn, Since it has the largest number of deaths in New York
#Number of Accidents Pe Day in Queens
# getAccidentsAndPlot("QUEENS")
all_dataC.head(3)
| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 08/18/2020 | 12:00 | BROOKLYN | 11214 | 40.601010 | -73.99352 | (40.60101, -73.99352) | NaN | NaN | 15 BAY 31 STREET | ... | Unspecified | NaN | NaN | NaN | 4342387 | Station Wagon/Sport Utility Vehicle | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN |
| 1 | 09/25/2020 | 0:01 | NaN | NaN | NaN | NaN | NaN | TRIBOROUGH BRIDGE | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4351600 | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN | NaN |
| 2 | 09/05/2020 | 19:25 | STATEN ISLAND | 10305 | 40.604397 | -74.06900 | (40.604397, -74.069) | FINGERBOARD ROAD | NARROWS ROAD NORTH | NaN | ... | Unspecified | Unspecified | NaN | NaN | 4346201 | Sedan | Sedan | Sedan | NaN | NaN |
3 rows × 29 columns
all_dataC.isnull().any()
year False BOROUGH True collision_factor False COLLISION_ID False NUMBER OF PERSONS INJURED True dtype: bool
#dtypes
all_dataC['CRASH TIME'] = pd.to_datetime(all_dataC['CRASH TIME'])
all_dataC['ZIP CODE'] = pd.to_numeric(all_dataC['ZIP CODE'], errors='coerce')
#rename
all_dataC = all_dataC.rename(columns={'CONTRIBUTING FACTOR VEHICLE 1': 'collision_factor'})
#add count column
all_dataC['count']=1
#extract datetime data to create 3 additional columns
all_dataC['year'] = all_dataC['CRASH TIME'].dt.year
#not null or unspecified
all_dataC = all_dataC[pd.notnull(all_dataC['collision_factor'])]
all_dataC = all_dataC.drop(all_dataC[all_dataC['collision_factor']=='Unspecified'].index)
all_dataC['year'] = all_dataC['CRASH TIME'].dt.year
#select columns
all_dataC = all_dataC[['year', 'BOROUGH', 'collision_factor', 'COLLISION_ID', 'NUMBER OF PERSONS INJURED']]
all_dataC.head()
| year | BOROUGH | collision_factor | COLLISION_ID | NUMBER OF PERSONS INJURED | |
|---|---|---|---|---|---|
| 0 | 2020 | BROOKLYN | Backing Unsafely | 4342387 | 0.0 |
| 1 | 2020 | NaN | Other Vehicular | 4351600 | 0.0 |
| 2 | 2020 | STATEN ISLAND | Driver Inattention/Distraction | 4346201 | 0.0 |
| 3 | 2020 | BROOKLYN | Driver Inattention/Distraction | 4353455 | 2.0 |
| 5 | 2020 | BRONX | Driver Inattention/Distraction | 4353187 | 0.0 |
#top collision factors by year, by borough
df_agg = all_dataC.groupby(['BOROUGH','year', 'collision_factor']).sum().reset_index()
# g = df_agg['count'].groupby(level=0, group_keys=False)
brooklyn = df_agg[df_agg['BOROUGH']=='BROOKLYN']
brooklyn.sort_values(by=['NUMBER OF PERSONS INJURED'], ascending=False).head()
| BOROUGH | year | collision_factor | COLLISION_ID | NUMBER OF PERSONS INJURED | |
|---|---|---|---|---|---|
| 70 | BROOKLYN | 2020 | Driver Inattention/Distraction | 196758140966 | 18539.0 |
| 77 | BROOKLYN | 2020 | Failure to Yield Right-of-Way | 73531192039 | 11732.0 |
| 111 | BROOKLYN | 2020 | Traffic Control Disregarded | 18755625494 | 4047.0 |
| 80 | BROOKLYN | 2020 | Following Too Closely | 44980979570 | 3701.0 |
| 78 | BROOKLYN | 2020 | Fatigued/Drowsy | 13791476136 | 2079.0 |
#top collision factors by borough
#top collision factors by year, by borough
df_agg = all_dataC.groupby(['BOROUGH','year', 'collision_factor']).sum().reset_index()
# g = df_agg['count'].groupby(level=0, group_keys=False)
brooklyn = df_agg[df_agg['BOROUGH']=='']
brooklyn.sort_values(by=['NUMBER OF PERSONS INJURED'], ascending=False).head()
| BOROUGH | year | collision_factor | COLLISION_ID | NUMBER OF PERSONS INJURED | |
|---|---|---|---|---|---|
| 190 | QUEENS | 2020 | Driver Inattention/Distraction | 215701410831 | 18774.0 |
| 197 | QUEENS | 2020 | Failure to Yield Right-of-Way | 83004729961 | 11779.0 |
| 231 | QUEENS | 2020 | Traffic Control Disregarded | 18909454028 | 3771.0 |
| 200 | QUEENS | 2020 | Following Too Closely | 41587980115 | 3230.0 |
| 183 | QUEENS | 2020 | Alcohol Involvement | 10771525761 | 1555.0 |
#top collision factors by borough
#top collision factors by year, by borough
df_agg = all_dataC.groupby(['BOROUGH','year', 'collision_factor']).sum().reset_index()
# g = df_agg['count'].groupby(level=0, group_keys=False)
brooklyn = df_agg[df_agg['BOROUGH']=='MANHATTAN']
brooklyn.sort_values(by=['NUMBER OF PERSONS INJURED'], ascending=False).head()
| BOROUGH | year | collision_factor | COLLISION_ID | NUMBER OF PERSONS INJURED | |
|---|---|---|---|---|---|
| 130 | MANHATTAN | 2020 | Driver Inattention/Distraction | 170553523923 | 11176.0 |
| 137 | MANHATTAN | 2020 | Failure to Yield Right-of-Way | 34731568263 | 4742.0 |
| 151 | MANHATTAN | 2020 | Other Vehicular | 42141153322 | 1992.0 |
| 140 | MANHATTAN | 2020 | Following Too Closely | 34158055847 | 1687.0 |
| 171 | MANHATTAN | 2020 | Traffic Control Disregarded | 8187444816 | 1534.0 |
collisionsDF.head(2)
| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 08/18/2020 | 12:00 | BROOKLYN | 11214 | 40.60101 | -73.99352 | (40.60101, -73.99352) | NaN | NaN | 15 BAY 31 STREET | ... | Unspecified | NaN | NaN | NaN | 4342387 | Station Wagon/Sport Utility Vehicle | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN |
| 1 | 09/25/2020 | 0:01 | NaN | NaN | NaN | NaN | NaN | TRIBOROUGH BRIDGE | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4351600 | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN | NaN |
2 rows × 29 columns
# Assuming that dataframe has DATE and Time columns sorted (meaning row 0 is latest date and last row represents the earliest date)
# It can be suppossed based on the tail() and head() results
oldest_date = collisionsDF.iloc[-1]['CRASH DATE'] #tail
newest_date = collisionsDF.iloc[0]['CRASH DATE'] #head
newest_date
'08/18/2020'
#Extract DateTime features
collisionsDF['collisions_datetime'] = pd.to_datetime(collisionsDF['CRASH DATE'], infer_datetime_format = True) #infer_datetime_format=True)
collisionsDF['collisions_time'] = pd.to_datetime(collisionsDF['CRASH TIME'], infer_datetime_format = True)
#collisionsDF.loc[:, 'collisions_datetime'] = collisionsDF['collisions_datetime'].dt.date
collisionsDF.loc[:, 'collisions_weekday'] = collisionsDF['collisions_datetime'].dt.weekday
collisionsDF.loc[:, 'collisions_weekofyear'] = collisionsDF['collisions_datetime'].dt.weekofyear
collisionsDF.loc[:, 'collisions_hour'] = collisionsDF['collisions_time'].dt.hour
collisionsDF.loc[:, 'collisions_minute'] = collisionsDF['collisions_time'].dt.minute
collisionsDF.loc[:, 'collisions_dt'] = (collisionsDF['collisions_datetime'] - collisionsDF['collisions_datetime'].min()).dt.total_seconds()
#collisionsDF['collisions_time'].dtype
collisionsDF.head()
| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | collisions_datetime | collisions_time | collisions_weekday | collisions_weekofyear | collisions_hour | collisions_minute | collisions_dt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 08/18/2020 | 12:00 | BROOKLYN | 11214 | 40.601010 | -73.99352 | (40.60101, -73.99352) | NaN | NaN | 15 BAY 31 STREET | ... | NaN | NaN | NaN | 2020-08-18 | 2020-11-30 12:00:00 | 1 | 34 | 12 | 0 | 256608000.0 |
| 1 | 09/25/2020 | 0:01 | NaN | NaN | NaN | NaN | NaN | TRIBOROUGH BRIDGE | NaN | NaN | ... | NaN | NaN | NaN | 2020-09-25 | 2020-11-30 00:01:00 | 4 | 39 | 0 | 1 | 259891200.0 |
| 2 | 09/05/2020 | 19:25 | STATEN ISLAND | 10305 | 40.604397 | -74.06900 | (40.604397, -74.069) | FINGERBOARD ROAD | NARROWS ROAD NORTH | NaN | ... | Sedan | NaN | NaN | 2020-09-05 | 2020-11-30 19:25:00 | 5 | 36 | 19 | 25 | 258163200.0 |
| 3 | 09/30/2020 | 14:00 | BROOKLYN | 11235 | 40.586376 | -73.94790 | (40.586376, -73.9479) | VOORHIES AVENUE | EAST 21 STREET | NaN | ... | NaN | NaN | NaN | 2020-09-30 | 2020-11-30 14:00:00 | 2 | 40 | 14 | 0 | 260323200.0 |
| 4 | 08/19/2020 | 12:36 | BRONX | 10472 | 40.830738 | -73.86707 | (40.830738, -73.86707) | NaN | NaN | 1231 SAINT LAWRENCE AVENUE | ... | NaN | NaN | NaN | 2020-08-19 | 2020-11-30 12:36:00 | 2 | 34 | 12 | 36 | 256694400.0 |
5 rows × 36 columns
#Data Preparation for the weekly analysis
#get count of collisions every hour on every weekday.
sunday = collisionsDF[collisionsDF['collisions_weekday'] == 6]
df_sundayhourlytripcount = sunday.groupby('collisions_hour').count()
monday = collisionsDF[collisionsDF['collisions_weekday'] == 0]
df_mondayhourlytripcount = monday.groupby('collisions_hour').count()
tuesday = collisionsDF[collisionsDF['collisions_weekday'] == 1]
df_tuesdayhourlytripcount = tuesday.groupby('collisions_hour').count()
wednesday = collisionsDF[collisionsDF['collisions_weekday'] == 2]
df_wednesdayhourlytripcount = wednesday.groupby('collisions_hour').count()
thursday = collisionsDF[ collisionsDF['collisions_weekday'] == 3]
df_thursdayhourlytripcount = thursday.groupby('collisions_hour').count()
friday = collisionsDF[collisionsDF['collisions_weekday'] == 4]
df_fridayhourlytripcount = friday.groupby('collisions_hour').count()
saturday = collisionsDF[collisionsDF['collisions_weekday'] == 5]
df_saturdayhourlytripcount = saturday.groupby('collisions_hour').count()
collisions_hr_x = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
sun_crashcounty = df_sundayhourlytripcount['CRASH TIME']
mon_crashcounty = df_mondayhourlytripcount['CRASH TIME']
tues_crashcounty = df_tuesdayhourlytripcount['CRASH TIME']
wed_crashcounty = df_wednesdayhourlytripcount['CRASH TIME']
thurs_crashcounty = df_thursdayhourlytripcount['CRASH TIME']
fri_crashcounty = df_fridayhourlytripcount['CRASH TIME']
sat_crashcounty = df_saturdayhourlytripcount['CRASH TIME']
df_sundayhourlytripcount.tail(1)
sun_crashcounty.tail(1)
# Create traces
trace1 = go.Scatter(x = collisions_hr_x, y = sun_crashcounty, mode = "lines+markers", name = 'Sunday')
trace2 = go.Scatter(x = collisions_hr_x, y = mon_crashcounty, name = 'Monday')
trace3 = go.Scatter( x = collisions_hr_x, y = tues_crashcounty, name = 'Tuesday')
trace4 = go.Scatter( x = collisions_hr_x, y = wed_crashcounty, name = 'Wednesday')
trace5 = go.Scatter( x = collisions_hr_x, y = thurs_crashcounty, name = 'Thursday')
trace6 = go.Scatter( x = collisions_hr_x, y = fri_crashcounty, name = 'Friday')
trace7 = go.Scatter(x =collisions_hr_x, y = sat_crashcounty, mode = "lines+markers", name = 'Saturday')
layout = dict(title = 'Weekly Collisions by Hour',
xaxis= dict(title= 'Hour of the day',ticklen= 5,zeroline= False),
yaxis= dict(title= 'Collisions quantity',ticklen= 5,zeroline= False)
)
linedata = [trace1, trace2, trace3, trace4, trace5, trace6,trace7]
fig = dict(data=linedata, layout=layout)
py.iplot(fig, filename='timeline-lineplot')

#Preparing data for the graph
W_0 = collisionsDF[collisionsDF['collisions_weekday'] == 0].groupby('collisions_hour').count()
W_1 = collisionsDF[collisionsDF['collisions_weekday'] == 1].groupby('collisions_hour').count()
W_2 = collisionsDF[collisionsDF['collisions_weekday'] == 2].groupby('collisions_hour').count()
W_3 = collisionsDF[collisionsDF['collisions_weekday'] == 3].groupby('collisions_hour').count()
W_4 = collisionsDF[collisionsDF['collisions_weekday'] == 4].groupby('collisions_hour').count()
W_5 = collisionsDF[collisionsDF['collisions_weekday'] == 5].groupby('collisions_hour').count()
W_6 = collisionsDF[collisionsDF['collisions_weekday'] == 6].groupby('collisions_hour').count()
#W_0.head()
trace = go.Heatmap(z=[W_0['CRASH TIME'],W_1['CRASH TIME'],W_2['CRASH TIME'],W_3['CRASH TIME'],W_4['CRASH TIME'],W_5['CRASH TIME'],W_6['CRASH TIME']],
y=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday'],
x=['Midnight','1am','2am','3am','4am','5am','6am','7am','8am','9am','10am','11am','Noon','1pm','2pm',
'3pm','4pm','5pm','6pm','7pm','8pm','9pm','10pm','11pm'],
colorscale='Reds',xgap = 10,ygap = 10)
layout = dict(title = 'Collisions per week per hour')
dataheat=[trace]
fig = dict(data = dataheat, layout=layout)
py.iplot(fig, filename='labelled-heatmap')
#Takeaway: Friday seems to have rush hours around 5pm. This may be impacting durations trip around that time. Could be interesing to know what
#is the street name with most collisions in that hour.

collisionsDF.groupby('ON STREET NAME').count().sort_values(by='NUMBER OF PERSONS INJURED', ascending=False).head(10)['NUMBER OF PERSONS INJURED']
ON STREET NAME BROADWAY 16994 ATLANTIC AVENUE 15061 BELT PARKWAY 12895 3 AVENUE 12285 NORTHERN BOULEVARD 11831 FLATBUSH AVENUE 9608 LONG ISLAND EXPRESSWAY 9558 BROOKLYN QUEENS EXPRESSWAY 9429 LINDEN BOULEVARD 9376 QUEENS BOULEVARD 9284 Name: NUMBER OF PERSONS INJURED, dtype: int64
TOP5 = collisionsDF.groupby('ON STREET NAME').count().sort_values(by='NUMBER OF PERSONS INJURED', ascending=False).head(5)['NUMBER OF PERSONS INJURED']
TOP5.keys()
Index(['BROADWAY ', 'ATLANTIC AVENUE ',
'BELT PARKWAY ', '3 AVENUE ',
'NORTHERN BOULEVARD '],
dtype='object', name='ON STREET NAME')
TOP5.keys()[4]
'NORTHERN BOULEVARD '
#Lets look at demand per hour.
bar0data = collisionsDF[collisionsDF['ON STREET NAME']==TOP5.keys()[0]].groupby(['collisions_hour']).count().reset_index()
bar1data = collisionsDF[collisionsDF['ON STREET NAME']==TOP5.keys()[1]].groupby(['collisions_hour']).count().reset_index()
bar2data = collisionsDF[collisionsDF['ON STREET NAME']==TOP5.keys()[2]].groupby(['collisions_hour']).count().reset_index()
bar3data = collisionsDF[collisionsDF['ON STREET NAME']==TOP5.keys()[3]].groupby(['collisions_hour']).count().reset_index()
bar4data = collisionsDF[collisionsDF['ON STREET NAME']==TOP5.keys()[4]].groupby(['collisions_hour']).count().reset_index()
trace0 = go.Bar(
x=['Midnight','1am','2am','3am','4am','5am','6am','7am','8am','9am','10am','11am','Noon','1pm','2pm',
'3pm','4pm','5pm','6pm','7pm','8pm','9pm','10pm','11pm'],
y= bar0data['NUMBER OF PERSONS INJURED'],
name='BROADWAY',
marker=dict(color='rgb(49,130,189)' ))
trace1 = go.Bar(
x=['Midnight','1am','2am','3am','4am','5am','6am','7am','8am','9am','10am','11am','Noon','1pm','2pm',
'3pm','4pm','5pm','6pm','7pm','8pm','9pm','10pm','11pm'],
y= bar1data['NUMBER OF PERSONS INJURED'],
name='ATLANTIC AVENUE',
marker=dict(color='rgb(255,192,203)',))
trace2 = go.Bar(
x=['Midnight','1am','2am','3am','4am','5am','6am','7am','8am','9am','10am','11am','Noon','1pm','2pm',
'3pm','4pm','5pm','6pm','7pm','8pm','9pm','10pm','11pm'],
y= bar2data['NUMBER OF PERSONS INJURED'],
name='BELT PARKWAY',
marker=dict(color='rgb(138,43,226)',))
trace3 = go.Bar(
x=['Midnight','1am','2am','3am','4am','5am','6am','7am','8am','9am','10am','11am','Noon','1pm','2pm',
'3pm','4pm','5pm','6pm','7pm','8pm','9pm','10pm','11pm'],
y= bar3data['NUMBER OF PERSONS INJURED'],
name='3 Avenue',
marker=dict(color='rgb(255,215,0)',))
trace4 = go.Bar(
x=['Midnight','1am','2am','3am','4am','5am','6am','7am','8am','9am','10am','11am','Noon','1pm','2pm',
'3pm','4pm','5pm','6pm','7pm','8pm','9pm','10pm','11pm'],
y= bar4data['NUMBER OF PERSONS INJURED'],
name='NORTHERN BOULEVARD',
marker=dict(color='rgb(255,0,0)',))
datax = [trace0, trace1, trace2, trace3, trace4]
layout = go.Layout(xaxis=dict(tickangle=-45),barmode='group')
fig = go.Figure(data=datax, layout=layout)
py.iplot(fig)

The rush hour at Broadway, Atlantic avenue and 3 avenue is 4 pm while at Northern Boulevard is 5pm. However, between 4pm and 6pm is the range of time with the most hight risk of collision.
Since driver inattention and distraction is the highest collision factor Y/Y and most accidents occur during workweeks and business hours, I propose a data driven based public awareness campaign focused on reducing driver distractions.
In April 9th, 2019 the Govenor's Trafic Saftey Committee (GTSC) launched a Distracted Driving PSA Campaign in NYC. The campaign featured on billboards, television, and radio.